System and method for teaching compositionality to convolutional neural networks

ABSTRACT

A system for teaching compositionality to convolutional neural networks includes an unmasked convolutional neural network comprising a first set of convolutional neural network layers; a first masked convolutional neural network comprising a second set of convolutional neural network layers; the unmasked convolutional neural network and the first masked convolutional network sharing convolutional neural network weights; the system training the unmasked and first masked convolutional neural networks simultaneously based on an objective function that seeks to reduce both discriminative loss and compositional loss.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No. 16/221,104, filed 14 Dec. 2018, which is a continuation of U.S. patent application Ser. No. 15/803,595, filed 3 Nov. 2017, which claims the benefit of U.S. Provisional Application Ser. No. 62/417,168, filed on 3 Nov. 2016, each of which are incorporated in their entirety by this reference.

TECHNICAL FIELD

This invention relates generally to the artificial intelligence field, and more specifically to new and useful systems and methods for teaching compositionality to convolutional neural networks.

BACKGROUND

Despite advances in computer vision, image processing, and machine learning, recognizing visual objects remains a task where computers fail in comparison with the capabilities of humans. Recognizing an object from an image not only requires recognizing the object in a scene but also recognizing objects in various positions, in different settings, and with slight variations. For example, to recognize a chair, the innate properties that make a chair a chair must be understood. This is a simple task for a human. Computers struggle to deal with the vast variety of types of chairs and the situations in which a chair may be present. Models capable of performing visual object recognition must be trained to provide explanations for visual datasets in order to recognize objects present in those visual datasets. Unfortunately, most methods for training such models either fall short in performance and/or require large training sets.

This issue is not confined solely to visual object recognition, but more generally applies to pattern recognition, which may be used in speech recognition, natural language processing, and other fields. Thus, there is a need in the artificial intelligence field to create new and useful systems and methods for teaching compositionality to convolutional neural networks.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1A is an example diagram view of a convolutional neural network;

FIG. 1B is an example diagram view of weights of a convolutional neural network;

FIG. 2A is a diagram view of a system of an invention embodiment;

FIG. 2B is a diagram view of a system of an invention embodiment;

FIG. 3A is an example view of a first penalty mask generation scheme;

FIG. 3B is an example view of a second penalty mask generation scheme;

and

FIG. 4 is a chart view of a method of an invention embodiment.

DESCRIPTION OF THE INVENTION EMBODIMENTS

The following description of the invention embodiments of the invention is not intended to limit the invention to these invention embodiments, but rather to enable any person skilled in the art to make and use this invention.

1. Neural Networks and Related Systems

Neural networks and related systems, including recursive cortical networks (RCNs), convolutional neural networks (CNNs), hierarchical compositional networks (HCNs), HMAX models, Slow Feature Analysis (SFA) systems, and Hierarchical Temporal Memory (HTM) systems may be used for a wide variety of tasks that are difficult to complete using standard rule-based programming. These tasks include many in the important fields of computer vision and speech recognition.

Neural networks and related systems can be represented as distributed processing elements that implement summation, multiplication, exponentiation or other functions on the elements incoming messages/signals. Such networks can be enabled and implemented through a variety of implementations. For example, a system may be implemented as a network of electronically coupled functional node components. The functional node components can be logical gates arranged or configured in a processor to perform a specified function. As a second example, the system may be implemented as a network model programmed or configured to be operative on a processor. The network model is preferably electronically stored software that encodes the operation and communication between nodes of the network. Neural networks and related systems may be used in a wide variety of applications and can use a wide variety of data types as input such as images, video, audio, natural language text, analytics data, widely distributed sensor data, or other suitable forms of data.

In particular, convolutional neural networks (CNNs) may be useful for performing inference on data for which feature recognition is independent of one or more dimensions of the data; for example, when detecting shapes in an image, the detected shapes are not dependent on their position in the image—the same features used to detect a square in one part of the image may be used to detect a square in another part of the image as well. These dimensions may be spatial (as in the 2D image example), but may additionally or alternatively be temporal or any suitable dimensions (e.g., a frequency dimension for audio or multispectral light data).

CNNs, as shown in FIG. 1A, take advantage of the dimensional structure of data by connecting neurons only to a small region of input data, selected using the previously mentioned dimensionality of the data. These regions are referred to as local receptive fields. A set of neurons, each neuron connected to a receptive field of neurons below it, as well as the output of that set of neurons, is referred to as a feature map. In a CNN, feature maps are unique in that they have shared weights; that is, each neuron of a given feature map (which corresponds to the same feature, but shifted across the data dimensions) is constrained to have the same input weights and biases as the other neurons of that feature map, as shown in FIG. 1B. Of course, it is often desirable for a given layer to have multiple feature maps (each able to correspond to a different feature). The layer containing the multiple feature maps is referred to as a convolutional layer (so called because the output of a neuron in the layer is a convolution operation performed on the inputs).

CNNs may also include pooling layers, which function to reduce the size of the output of a set of neurons (typically the output of a convolutional layer, but pooling layers may be used for any set of neurons; e.g., on top of the input neurons). For example, a pooling layer may take the maximum activation of a set of neurons as an output (i.e., max-pooling). Pooling layers are applied to each feature map separately. Commonly, pooling layers are used between convolutional layers in CNNs. CNNs also may include other layers, such as input layers, output layers, etc.

As shown in FIG. 1A, an example CNN includes an input layer IL, a first convolutional layer CL1, a pooling layer PL, a second convolutional layer CL2, and an output layer OL. The input neurons in the input layer are connected to two feature maps in CL1 (note that while two sets of input neurons are shown, this is for display purposes only—the neurons of the first feature map and of the second feature map are connected to the same input neurons). As shown, CL1 has a receptive field of 3.

Each feature map is in turn connected to a set of pooling neurons in PL. As shown, PL has a pooling window of 2.

The output of PL is used as an input to the second convolution layer CL2, which has a receptive field of 2. Note here that each neuron of CL2 connects to each feature map of CL1/PL; in other words, the feature map of CL2 (there is only one as shown in FIG. 1A) is a combination of all feature maps in CL1. This is typical of CNNs; generally, in a CNN, all feature detectors in a convolutional layer have access to all features from the previous convolutional layer.

Finally, the output of CL2 is used as input to the output layer OL. Note here that OL is fully connected to CL2.

By limiting neural network connections via exploiting the locality of the receptive fields according to data dimensionality, CNNs can perform inference with a fraction of the complexity required by an older fully-connected model.

Nevertheless, typical CNNs fail to exhibit significant compositionality; i.e., the preference that the representation of part of a dataset should be similar to the corresponding part of the representation of the dataset. For example, a neural network performing object recognition on image data ideally recognizes objects regardless of their context (e.g., other objects in the image, background). In typical CNNs, this is not the case—the activations corresponding to the region surrounding a first object are significantly affected by changes in background or the presence of nearby objects.

The system and method of an invention embodiment function to teach compositionality to CNNs (resulting in enhanced-compositionality CNNs). The system and method do so not by explicitly modifying the structure of the CNNs, but rather by using unique and novel techniques to modify CNN training objectives and therefore bake compositionality into the CNNs during training. This is distinct from past approaches designed to create compositional neural networks, which generally mandate a particular network architecture or parameterization.

As described previously, compositionality is the preference that the representation of part of a dataset should be similar to the corresponding part of the representation of the dataset. A formal definition of compositionality for an image dataset is as follows.

Let X be an image, m a binary mask that identifies part of X (i.e, m is a tensor of the same shape as X with is indicating part affiliation), φ a mapping from an image onto an arbitrary feature layer of a CNN, and p the projection operator onto the feature map represented by φ. Then, we define φ to be compositional if the following equality holds:

φ(m·X)=p(m)·φ(X)

Here, the · operator represents element-wise multiplication. The projection operator, p, down-samples the object mask to the size of the output of φ. For instance, if φ(X) is the activations of a convolutional layer and has size (x, y, c) (where the first two dimensions are spatial and c is the number of feature channels), p will down-sample the object mask to size (x, y) and then stack c copies of the down-sized object mask on top of each other to produce a mask of size (x, y, c).

Note that we do not require the preceding equation to hold for all possible masks m, as this would constrain φ to be the identity map. Instead, we apply the inductive bias selectively to image parts that we would like to be treated as a unit—obvious choices include objects, object parts, or image regions that carry certain attributes.

2. System for Teaching Compositionality to CNNs

A system 100 for teaching compositionality to CNNs includes an unmasked CNN 110, a masked CNN 120, an object mask 130, and a penalty mask 131, as shown in FIGS. 2A and 2B.

The system 100 functions to train a CNN (the unmasked CNN 110) for both discrimination and compositionality using one or more masked CNNs 120. The masked CNNs 120 are similar to the unmasked CNN 110 (and in fact share weights with the CNN 110), but include an object mask 130 at their inputs (and potentially at intermediate locations as well). The object mask 130 for a given masked CNN 120 masks input to filter out data except that which corresponds to a particular object (e.g., by selecting a region containing the object). Penalty masks 131 function similarly, but are used for calculating compositional loss (i.e., activations of unmasked networks may be masked for comparison to outputs of masked networks, but this masking is used only for such comparison and does not directly affect the activations transmitted from layer to layer in the unmasked network 110).

The system 100 preferably trains the unmasked CNN 110 using an objective function that attempts to minimize (or otherwise reduce) both discriminative loss and compositional loss. Such training is discussed in greater detail in the section on the method 200, which the system 100 preferably uses to accomplish CNN training. The system 100 may additionally or alternatively be trained in any manner (e.g., via stochastic gradient descent). Training data may be any set of data; e.g., images, video, audio, speech, medical sensor data, natural language data, financial data, application data, traffic data, environmental data, etc.

Note that the system 100 may include a single unmasked CNN 110 and a single masked CNN 120 (as in FIG. 2A) or a single unmasked CNN 110 and multiple masked CNNs 120 (as in FIG. 2B). For example, the system 100 may include a number of masked CNNs 120 equivalent to the number of objects in a set of training data.

The unmasked CNN 110 is a CNN without any masks intermediate to its layers (in other words, a standard CNN), with the exception that the unmasked CNN 110 may include connections at one or more layers to penalty masks 131.

The unmasked CNN 110 may be any type of CNN. For example, the unmasked CNN 110 may include alternating convolutional and pooling layers (eventually coupled to an output layer), as in FIG. 2A. Additionally or alternatively, the unmasked CNN 110 may include any number of CNN layers of any type and in any order.

The masked CNN 120 is preferably substantially similar to the unmasked CNN 110, except that input to the masked CNN 120 is masked by an object mask 130 (as shown in FIGS. 2A and 2B). The masked CNN 120 may include any number of object masks 130. For example, as in FIG. 2B, a masked CNN 120 may include object masks between layers of the masked CNN 120 (which prevents a layer from the CNN from erroneously propagating activations that occur outside of a region defined by the object mask 130. Note that as shown in FIG. 2B each displayed layer of the CNNs no and 120 (referenced as Am) includes both pooling and convolutional layers (while the OL layers represent fully connected output layers).

Note that the unmasked CNN 110 and the masked CNNs 120 preferably share weights, constraining the training process.

The object mask 130 functions to filter the input to and/or output of layers of the masked CNNs 120. The object mask 130 is preferably a binary mask applied to a region (e.g., a spatial region, a temporal region, a spatiotemporal region, or any other way of examining a region within a dataset) containing a particular object. For example, if an image includes a cup sitting on a table, the region containing the cup may contain is (i.e., allowing activations in that region to pass) while the region containing the table and the background of the image (and any other objects) may contain zeros (i.e., blocking activations in that region). While the object mask 130 is referred to as a mask (due to potential applications in image categorization) it is understood that the object mask 130 may be any filter that suppresses activations outside of a region containing an object.

The object mask 130 is preferably present at the input to masked CNNs 120, but may additionally or alternatively be present in any location (e.g., between layers of masked CNNs 120).

In one implementation of an invention embodiment, the penalty mask 131 is preferably substantially similar to the object mask 130, except that the penalty mask 131 is used to filter output of layers of the unmasked CNN 110 (such filtering enabling the comparison of layer output from the unmasked CNN 110 to that of masked CNNs 120). In this implementation, shifts in activation in the region of the object for a given penalty mask 131/object mask 130 pair are penalized, but background activations are not (because they are filtered out by the penalty mask 131). An example of this implementation is as shown in FIG. 3A.

In a second implementation of an invention embodiment, the penalty mask 131 is similar to the object mask 130, except that activations in the background (i.e., area of the dataset where no object regions are defined) are passed by the penalty mask 131 in addition to activations in the region of a given object. In this implementation, both positional shifts in activation in the object region and background activations are penalized. An example of this implementation is as shown in FIG. 3B.

3. Method for Teaching Compositionality to CNNs

A method 200 for teaching compositionality to CNNs includes receiving an initial masked CNN structure S210, receiving a training dataset S220, and learning a set of CNN parameters S230, as shown in FIG. 4.

The method 200 functions to set tunable parameters of a CNN such that the CNN is trained to perform data inference based on a set of data used to train the CNN (i.e., the training dataset received in S220).

The method 200 is preferably implemented on the system 100, but may additionally or alternatively be implemented by any neural network capable of implementing the steps of the method 200. The method 200 is preferably implemented by a computing system (e.g., computer, distributed computing system, etc.).

S210 includes receiving an initial masked CNN structure. S210 preferably includes receiving information describing the structure of an unmasked CNN—e.g., data that specifies the neurons of the unmasked CNN and their connections (e.g., layer type and order, connections between layers, etc.). This information may be specified in a number of forms; for example, CNN structure may be specified by specifying each variable node and factor node and their connections. S210 preferably additionally includes receiving information describing masked CNNs (corresponding to the unmasked CNNs) as well as object and penalty masks (e.g., location, mask definition, etc.).

Alternatively, CNN structure may be specified relying on known structural rules (e.g., a two-layer CNN, each layer containing a convolutional sub-layer and a pooling sub-layer, connections specified by stated pooling windows and receptive fields). CNN structure information may include any CNN structural or parametric information described in the section on the system 100 as well as any additional information that may be used in the course of the method 200.

S210 may additionally include receiving hyperparameters of the CNNs (discussed in S230).

S220 includes receiving a training dataset. S220 functions to receive a set of training data (henceforth referred to as X). The set of training data preferably includes multiple elements (e.g., {X_(n)}_(n=1) ^(N)); for example, each element may correspond to a different training image of an image dataset. Training data may additionally include corresponding classifying information; for example, a dataset may include a set of labels C: {X_(n), C_(n)}_(n=1) ^(N).

Accordingly, training data may be unlabeled, partially labeled, or fully labeled. Likewise, training data may be complete (e.g., information is provided for each input neuron of the CNN) or incomplete (e.g., information is not provided for all input neurons).

Training data may be any set of data for which inference or generation is desired; e.g., images, video, audio, speech, medical sensor data, natural language data, financial data, application data, traffic data, environmental data, etc.

S230 includes learning a set of CNN parameters. S230 functions to learn values for tunable parameters of the CNN based on the training dataset (allowing the CNN to perform inference for data objects similar to those the CNN is trained on). These tunable parameters are preferably chosen by attempting to minimize an objective function that takes into account both discriminative loss and compositional loss. Additionally or alternatively, the tunable parameters may be chosen in any manner.

Tunable parameters are preferably set by attempting to minimize the objective function using a stochastic gradient descent algorithm, but may additionally or alternatively be set in any manner.

In one example implementation of an invention embodiment, S230 includes attempting to minimize the following objective function:

=

_(d)+

_(c)

where

_(d), the discriminative loss function, is defined as

$\mathcal{L}_{d} = {{\frac{1}{K}{\sum\limits_{k}{\gamma L_{mk}}}} + {\left( {1 - \gamma} \right)L_{u}}}$

where K is the number of masked CNNs (and k is the index of masks/masked CNNs), L_(mk) corresponding to a discriminative loss term for each masked CNN, L_(u) corresponding to a discriminative loss term for the unmasked CNN, and γ is a hyperparameter between 0 and 1; and where

_(c), the compositional loss, is defined as

$\mathcal{L}_{c} = {\frac{1}{K}{\sum\limits_{k}{\sum\limits_{n}{\lambda_{n}{{{\phi_{m_{k},n} - {\phi_{u,n}m_{k}^{\prime}}}}}_{2}^{2}}}}}$

where n is a layer index, λ_(n) is a layer-specific hyperparameter, ϕ_(m) _(k) _(,n) is the activation of the kth masked CNN (corresponding to object mask m_(k)) at layer n, ϕ_(u,n) is the activation of the unmasked CNN at layer n and m′_(k) is the penalty mask of the unmasked CNN corresponding to the object mask m_(k) . Additionally or alternatively, the masked/unmasked CNNs may be trained in any manner that increases CNN compositionality. Note that the discriminative loss functions may be instantiated as softmax-cross entropy for joint class prediction or sigmoid-cross entropy for independent class prediction (or in any other manner).

The methods of the preferred embodiment and variations thereof can be embodied and/or implemented at least in part as a machine configured to receive a computer-readable medium storing computer-readable instructions. The instructions are preferably executed by computer-executable components preferably integrated with a convolutional neural network operating on a computer system. The computer-readable medium can be stored on any suitable computer-readable media such as RAMs, ROMs, flash memory, EEPROMs, optical devices (CD or DVD), hard drives, floppy drives, or any suitable device. The computer-executable component is preferably a general or application specific processor, but any suitable dedicated hardware or hardware/firmware combination device can alternatively or additionally execute the instructions.

As a person skilled in the art will recognize from the previous detailed description and from the figures and claims, modifications and changes can be made to the preferred embodiments of the invention without departing from the scope of this invention defined in the following claims. 

We claim:
 1. A system comprising a non-transitory computer readable medium having stored thereon software instructions that, when executed by a processing system, cause the processing system to train a first and second set of convolutional neural network layers by: determining a masked input, comprising masking a training input with a first mask; with the masked input and using the second set of convolutional neural network layers, determining a first result; with the training input and using the first set of convolutional neural network layers: determining intermediate outputs of a layer of the first set; and based on the intermediate outputs, determining a second result; determining compositional loss based on the intermediate outputs masked with a second mask; determining discriminative loss based on the first and second results; and training both the first and second sets of convolutional neural network layers based on both the discriminative loss and the compositional loss.
 2. The system of claim 1, wherein the first and second sets of convolutional neural network layers share convolutional neural network weights.
 3. The system of claim 1, wherein the first mask is associated with the second mask.
 4. The system of claim 3, wherein the first mask and the second mask suppress activation of identical regions.
 5. The system of claim 1, wherein the first mask suppresses a first region of a training input; wherein the second mask suppresses activation of at least a subset of the first region.
 6. The system of claim 1, wherein the first mask comprises an object mask and the second mask comprises a second object mask.
 7. The system of claim 1, wherein the first and second sets of convolutional neural network layers are trained based on an objective function that seeks to reduce both discriminative loss and compositional loss.
 8. The system of claim 7, herein objective function models discriminative loss using softmax-cross entropy for joint class prediction.
 9. The system of claim 7, wherein the objective function models discriminative loss using sigmoid-cross entropy for independent class prediction.
 10. The system of claim 1, wherein the first and second sets of convolutional neural network layers share all convolutional neural network weights.
 11. The system of claim 1, wherein training both the first and second sets of convolutional neural network layers comprises simultaneously updating weights of the first and second sets using a gradient descent algorithm.
 12. The system of claim 1, wherein determining the second result comprises determining corresponding intermediate outputs of a corresponding layer of the second set of convolutional neural network layers, wherein the corresponding layer of the second set corresponds to the first layer of the first set, wherein determining the compositional loss comprises: determining masked intermediate outputs with the intermediate outputs using the second mask; and comparing the masked intermediate outputs with the corresponding intermediate outputs of the corresponding layer of the second set.
 13. The system of claim 1, wherein the compositional loss is further determined based on a second set of intermediate outputs for an other layer of the first set.
 14. A method for a convolutional neural network comprising: determining a masked input, comprising masking a training input with a first mask; with the training input and using a first set of convolutional neural network layers, determining intermediate outputs of a layer of the first set of convolutional neural network layers; determining compositional loss based on the intermediate outputs and a second mask; determining discriminative loss from respective outputs of the first set of convolutional network layers and a second set of convolutional neural network layers based on the training input and masked input, respectively; and updating convolutional neural network weights of both the first and second sets of convolutional neural network layers based on the compositional loss and the discriminative loss; wherein: the first and second sets of convolutional neural network layers share convolutional neural network weights.
 15. The system of claim 14, wherein the second mask suppresses activation of a subset of the first region.
 16. The system of claim 14, wherein the first mask comprises an object mask and the second mask comprises a second object mask.
 17. The system of claim 16, wherein the first mask and the second mask suppress activation of identical objects.
 18. The system of claim 14, wherein the first and second sets of convolutional neural network layers are trained based on an objective function that seeks to reduce both discriminative loss and compositional loss.
 19. The system of claim 14, wherein the first and second sets of convolutional neural network layers share all convolutional neural network weights.
 20. The system of claim 14, further comprising determining an additional set of intermediate outputs for the other layer of the first set of convolutional neural network layers, wherein determining the compositional loss comprises: masking the intermediate outputs and the additional intermediate outputs. 